Learning a Machine for the Decision in a Partially 
Observable Markov Universe 

Frederic Dambreville 
Delegation Generale pour rArmement, DGA/CTA/DT/GIP 
16 Bis, Avenue Prieur de la Cote d'Or 
F 94114, France 
Email: Frederic . DAMBREVILLEOdga . defense . gouv . f r 

February 1, 2008 
Abstract 

In this paper, we are interested in optimal decisions in a partially 
observable Markov universe. Our viewpoint departs from the dynamic 
programming viewpoint: we are directly approximating an optimal 
strategic tree depending on the observation. This approximation is 
made by means of a parameterized probabilistic law. In this paper, 
a particular family of hidden Markov models, with input and output, 
is considered as a learning framework. A method for optimizing the 
parameters of these HMMs is proposed and applied. This optimization 
method is based on the cross-entropic principle. 

Keywords: Control, MDP/POMDP, Hierarchical HMM, Bayesian Networks, Cross- 
Entropy 

Notations. Some specific notations are used in this document. 

• The variables x, y, z and m are used for the action, observation, world 
state and machine memory, 

• The time t is starting from stage 1 to the maximal stage T. Variables 
with subscript outside this scope are synonymous to 0. For example, 
Y(t=i'^{^t\xt~i) means 7r(2;i|0) ]^^2 ^(^tl^^i-i) ) & Markov chain. 
A similar principle is used for the level supscript A in the definition of 
hierarchical HMM, 
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• Braces used with subscripts, eg. only have a grammatical mean- 
ing. More precisely, it means that the symbols inside the braces are 
duplicated and concatenated according to the subscript. For example, 
{fk{}l=ix{)}l=3 means hU2{h{x))) and {xk}l=t means Y\^k=tXk, 

• The generic notation for a probability is P. However, the functions 
p, TT and h denote some specific components of the probability, p is 
the law of the observation y and state z conditionally to the action 
X. TT is a stochastic policy, ie. a law of the action conditionally to 
the observation, h is an approximation of vr by a HMM family. The 
hidden state of h is defined as the machine memory m. 

1 Introduction 

There are different degrees of difficulty in planning and control problems. 
In most problems, the planner have to start from a given state and ter- 
minate in a required final state. There are several transition rules, which 
condition the sequence of decision. For example, a robot may be required 
to move from room A, starting state, to room B, final state; its action could 
be go forward, turn right or turn left, and it cannot cross a wall; these are 
the conditions over the decision. A first degree in the difficulty is to find 
at least one solution for the planning. When the states are only partially 
known or the actions are not deterministic, the difficulty is quite enhanced: 
the planner has to take into account the various observations. Now, the 
problem becomes much more complex, when this planning is required to be 
optimal or near-optimal. For example, find the shortest trajectory which 
moves the robot from room A to room B. There are again different degrees 
in the difficulty, depending on the problem to be deterministic or not, de- 
pending on the model of the future observations. In the particular case 
of a Markovian problem with the full observation hypothesis, the dynamic 
programming principle could be efficiently applied (Markov Decision Pro- 
cess theory/MDP). This solution has been extended to the case of partial 
observation (Partially Observable Markov Decision Process/POMDP), but 
this solution is generally not practicable, owing to the huge dimension of the 
variables. 
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In this paper, we are interested in optimal planning with partial observation. 
A Markovian hypothesis is made. Our viewpoint departs from the dynamic 
programming viewpoint: we are directly approximating an optimal strategic 
tree depending on the observation. This approximation is made by means 
of a parameterized conditional probabilistic law, ie. actions depending on 
observations. The main problems are the choice of the parameterized laws 
family and the learning of the optimal parameters. A particular family of 
(hierarchical) hidden Markov models, with input and output, has been used 
as the learning framework. A method for optimizing the parameters of these 
HHMMs is proposed and applied. This optimization method is based on the 
cross-entropic principle The resulting parameterized law could be seen 
as a Virtual Machine which could generate near-optimal dynamic strategies 
for any drawing of the problem. 

The next section introduces some formalism and gives a quick description of 
the MDP/POMDP problems. It is recalled that the Dynamic Programming 
method could solve these problems, but that this solution is intractable for 
an actual POMDP problem. It is then proposed a new near-optimal plan- 
ning method, based on the direct approximation of the optimal decision tree. 
The third section introduces the family of Hierarchical Hidden Markov Mod- 
els. A particular sub-family of HHMMs is proposed as a candidate for the 
approximation of decision trees. The fourth section describes the method 
for optimizing the parameters of the HHMM, in order to approximate the 
optimal decision tree for the POMDP problem. The cross-entropy method 
is described and applied. The fifth section gives an example of application. 
The paper is then concluded. 

2 MDP and POMDP 

It is assumed that a subject is acting in a given world with a given purpose 
or mission. The goal is to optimize the accomplishment of this mission. As a 
framework for the optimization, a prior model is hypothetized for the world 
(world state and observation depending on the action of the subject) and 
for the mission evaluation (a function depending on world state, observation 
and action). In the next paragraphs, a Markovian modelling of the world 
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Figure 1: A Markovian World 




and a recursive modelling of the evaluation are considered. 

The world. Our modelling of the world is based on the MDP/POMDP 
formalism, ie. (Partially Observable) Markov Decision Process. In the MDP 
or POMDP formalism, the world is considered as a Markov chain condi- 
tioned by the past action of the subject. Assume that the world is charac- 
terized by a state variable z and that the action is described by a variable 
X. Assume that the time is starting from 1, and denote respectively zt and 
Xt the world state and the action for the time t . The law of z with respect 
to the past actions x thus verifies the property of Markov: 

P{zt\zi;t-.l,Xi:t^l) = p{zt\zt-l,Xt-l) , 

where the notation zi-^t means zi, . . . ,zt , and the notation zq , ie. before 
starting time, means . This Markov chain is thus characterized by two 
functions: an initial law p(zi|0) and a transition law p{zt\zt-i, xt-i) for 
t >2 . This distinction is often omitted in this paper, but it is implied. The 
law of z\x is represented graphically by the Bayesian Network of figure^. In 
this description, the arrows indicate the dependency between variables. For 
example, zt — > zt-\-i ^ xt means that the law of zt-\-i is defined conditionally 
to the past variables xt and zt ■ Such Bayesian Network representations will 
be used several times in this paper for their pedagogic skill. 

The observation. MDP and POMDP assume that the world is observed 
during the process. 

In MDP, the world is fully observed; the observation at time t is the world 
state Zt . 

In POMDP, the world is partially observed; the observation at time t is 
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Figure 2: A Hidden Markov Model (with control) 
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denoted yt . The variable of observation yt is only depending on the current 
world state zt : 

P{yt\zi:t,yi:t~i,xi.,t) =v{yt\zt) ■ 

The variables y, z\x constitute a Hidden Markov Model (with control). This 
HMM is represented in figure 121. 

Evaluation and optimal planning. The previous paragraphs have built 
a modelization of the world, of the actions and of the observations. We 
are now giving a characterization of the mission to be accomplished. The 
mission is limited in time. Let T be this maximum time. In the most 
generality, the mission is evaluated by a function V {xi-T-,yi:T-, zi-t) defined 
on the trajectories of x, y, z {V{xi-t, zi-t) in the case of a MDP). Typically, 
the function V could be used for computing the time needed for the mission 
accomplishment. The purpose is to construct an optimal decision tree x{ohs) 
depending on the observation ohs in order to maximize the mean evaluation. 
This is a dynamic optimization problem, since the actions depend on the 
previous observations. This problem is expressed differently for MDP and 



POMDP: 



MDP. ohst = Zt . Optimize xt{zi;t-i)\i<t<T ■ 



T 



Find xo G argmax 



Y,V{xtiz,.,t^i)\ 



l<t<T,Zl:T) JJp(zt|zt-l,Xj_l(zi:t_2)) • 
t=l 
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Figure 3: MDP planning 
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Figure 4: POMDP planning 
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POMDP. obst = yt ■ Optimize xt{yi:t~i)\i<t<T ■ 

Find xo G argmax^ ^ y(xt(yi;t_i)|i<t<T, yi:T, ^Iit) 

yi:T 2l:T 

T (1) 
Y[p{yt\zt) P{zt\zt-1, Xt-l{yi:t-2)) • 

t=l 

These optimizations are schematized in figures El and HI In these figures, the 
double arrows ^ characterize the variables to be optimized. More precisely, 
these arrows describe the flow of information between the observations and 
the actions. The cells denoted 00 are making decisions and transmitting 
all the received and generated information (including the actions). This 
architecture illustrates that planning with observation is an in(de) finite- 
memory problem : the decision depends on the whole past observations. But 
in the particular case of MDP, the problem is finite memory: the dashed 
arrows of figure 01 may be removed resulting in the finite memory BN of 
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Figure 5: MDP planning with finite memory 
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figure [5] (memory of the past decision is maintained) . 

This "finiteness" is an ingredient which makes the MDP problems rather 
easily solvable by means of the dynamic programming method. DP methods 
are introduced in section ITTl As a result, DP produces the root xo,i of the 
optimal decision tree, which is a sufficient tool for the decision in an actual 
open-loop planning process. More precisely, the whole process for optimizing 
the actual mission takes the form: 

1. Initialize the MDP/POMDP priors of the mission, 

2. Compute the theoretical optimal decision root xq^i-, 

3. The decision xo,i is applied to the actual world, 

4. Make actual observations and update the prior; in particular, the time 
is forwarded by 1, 

5. Repeat from step 121 until accomplishment of the mission. 
2.1 Dynamic Programming method 

This presentation is extremely simplified. More detailed references are 
available |2j|2l 15 . The Dynamic Programming method works for an additive 
evaluation V{xi-t-, zi-.t) = Ylit=i ^t{xt-, zt) ■ In this simplified presentation, a 
final evaluation is assumed, ie. V{xi;T-, zi-t) = Vt{xt, zt) , but the principle 
is the same. 
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The MDP case. To be computed: 



T 



xo 1 € arg max 



Vt{xt{zI:T-i), Zt) ^p{zt\zt-l, Xt-l{zi;t-2)) ■ 




This computation is easily refined: 



xo,i G arg. 



n\ay.y^^p{zt\zt-i,xt-i) ) Vt{xt,zt) ■ (2) 

I 

Zt ) t = l 



This factorization is deeply related to the finite memory property of the 
MDP. More precisely, the "optimal" evaluation for stage t, denoted Wt , 
may be computed recursively, as a direct result of : 



Associated to these evaluations are defined the optimal strategies xo,t for 
the stage t : 



In particular: xo,i G argmax^,-^ p(2;i|0)VFi(xi, zi) . 

It happens that the optimal strategy for stage t is only depending on the 
previous action and state zt-i ■ This is exactly what have been intuited 
from figure 13 as a finite memory property. 

This method, known as the dynamic programming, thus relies on the com- 
putation of backwards functions defined over the variables xt,zt ■ The DP 
methods have been extended to POMDP problems, but this time the recur- 
sive functions are much more intricated and almost intractable. 

The POMDP case. To be computed: 






yi:T Zl;T 



T 



Wp{yt\zt)p{zt\zt-\,Xt-l{yi:t-2)) ■ 
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The previous factorization works partially: 

xo,i G arg^^ <^ f ^^T{xT,yT,ZT)Wp{yt\zt)p{zt\zt-i,xt-i) . 

I yt ) Zl:T t=l 

But this factorization is incomplete and the previous recursion is not possi- 
ble anymore. Still, there is an answer to this problem by means of a dynamic 
programming method. But the recursion cannot rely on the variables xt,yt 
(last action and observation) only: these variables contain insufficient infor- 
mations to predict the future. It is necessary to transmit the probabilistic 
belief over zt estimated from the whole past actions and observations. De- 
note bt{zt) the belief over zt ■ The solution of the POMDP is constructed 
recursively: 

Wribr) = maxy^'S^^bT{zT)p{yT\zT)VT{xT,yT,ZT) ; 
Pt+i{zt+i\xt,yt,bt) = '^bt{zt)p{yt\zt)p{zt+i\zt,xt) , 

Zt 

Wt{bt) = rna^y^ Wt+i{Pt+i{-\xt,yt,bt)) ; 

Vt 

P2{z2\xi,yi) = ^p{zi\i}l)p{yi\zi)p{z2\xi, zi) , 
xo,i e argmaxV] VF2(/32(-ki,yi)) • 

Xl ^ ' 

V yi 

This recursive construction of Wt is very difficult, since the variable bt is 
continuous and has a very high dimension. For example, if the state zt is 
the position of a target in a discrete space of 20 x 20 cells (the same space 
as in the example^ of sectional , this dimension is 399 . Of course, there are 
many possible refinements and approximations, but even so, the DP method 
becomes easily intractable. 

This particular difficulty for solving POMDP problems was illustrated by 
the figures 01 and 0]: MDPs are fundamentally more simple than POMDPs. 
Indeed, since the world state z is a Markov chain, the knowledge of zt 
together with the last action xt is sufficient to predict all the future (figureEj) • 
On the other hand, the knowledge of yt and of xt is not sufficient to predict 
the future of the Markov chain z. 

^In this example the dimension is even more, ie. 4^400"^ — 1 . 
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2.2 Direct approximation of the decision tree 

While the classical DP method generally fails solving the POMDP optimiza- 
tion, some authors have investigated approximations of the solution. Our 
work investigates the direct approximation of the decision tree. 

In an optimization problem like , the value to be optimized, xq , is a 
deterministic object. In this precise case, xq is a tree of decision, that is 
a function which maps to a decision xt from any sequence of observation 
yi;t-i ■ It is possible however to have a probabilistic viewpoint.^ Then 
the problem is equivalent to finding x,y ^ 'K{x\y) , a probabilistic law of 
actions conditionally to the past observations, which maximizes the mean 
evaluation: 

T 

^(^)=JZX]X] WT^{xt\xi:t-l,yi:t-l) 

Xl:T yv.T Zl:T t=l 

'xP{yi:T,Zi.,T\xi;T) V {xi;T ,yi:T , ZI;t) , 

where: 

T 

P{yi:T,Zi;T\xi:T) = Y^PiVtlzt) p{zt\zt-l, Xt^i) . 
t=l 

If the solution is optimal, there will not be a great difference with the de- 
terministic case: when the solution xq is unique, the optimal law ttq is a 
dirac on xq ■ But things change when approximating vr . Now, why us- 
ing a probability to approximate the optimal strategy? The main point is 
that probabilistic models seem more suitable for approximation, than for 
example deterministic decision trees. The second point is that we are sure 
to approximate continuously: indeed, vr i— > V{tt) is continuous. There is a 
third point, which is more "philosophic". Considering the figure |3, it ap- 
pears clearly that it is symmetric: the arrows =^ and — > play a quite similar 
role. In the deterministic viewpoint, this is just a coincidence. But if we are 
considering probabilistic strategies, the arrows =^ in the Bayesian Network 
of figure are constituting a HMM controlled by the observation, and with 
infinite memory. Of course, a natural approximation of a HMM with infinite 
memory is a HMM with finite memory! Then replacing the infinite-memory 
states oo by a finite-memory variable m , a perfectly symmetric problem is 
^Such probabilistic viewpoint is more common, in fact necessary, in game problems. 
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Figure 6: Finite-memory planning approximation 



Zl ^ Z2 ^ 23 Zt ^ Zt+l 




xi yi X2 y2 X3 ys xt yt xt+i yt+i 




mi ) • m2 > ms mt > mt+i 



obtained in figure 0; perhaps some kind of duality relation. Well, aestheti- 
cism is not the unique interest of such approximation. The most interesting 
point is that there are many methods which apply on HMM. Particularly 
for learning and optimizing parameters. Then, the approach developped in 
this paper is quite general and can be split up into two points: 

• Define a family of parameterized HMMs Tl , 

• Optimize the parameters of the HMM in order to maximize the mean 
evaluation: 

Find hn G argmaxl/(/i) . 

hen 

The first point is discussed in the next section, where a sub-class of hierarchi- 
cal HMMs is defined. The second point is discussed in the following section, 
which explains a cross-entropic method for optimizing the parameters. 

3 Hierarchical Hidden Markov Model 

In this work, a significant investment was dedicated to the construction of a 
"good" parameterized probabilitic family. In order to investigate future com- 
plex problems, our choice has focused on hierarchical HMM models. These 
models are inspired from biology: to solve a complex problem, factorize it 
and make decisions in a hierarchical fashion. Low hierarchies manipulate 
low level informations and actions, making short-term decisions. High hier- 
archies manipulate high level informations and actions (uncertainty is less). 
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making long-term decisions. 

The implementation of such hierarchical models have not been completely 
investigated yet; a rather simple HHMM has been implemented, and our 
example tests just needed few hierarchies. Still, this section is deliberately 
discussing about HHMM with more hindsight. 

3.1 Generality 

This section gives some general references about HHMMs without control. 
However, HHMM have been used yet in control problems. In [3] a HHMM 
modelling of the world is used in a POMDP problem, but the control (ie. 
the action) is not propagated within the HHMM. In our models, actions and 
observations are both propagated within the same HHMM structure. Thus, 
the model of controlled HHMM defined here is not related to any existing 
work, the author knows. But it is interesting to introduce this section with 
some results about uncontrolled HHMM. 

A hierarchical hidden Markov model (HHMM) may be defined as a HMM 
which output is either a hierarchical HMM or an actual output. A HHMM 
could also be considered as a hierarchy of stochastic processes calling sub- 
processes. It is noteworthy that the length of a sub-process call depends on 
the sub-process law. This length is variable. 

Simple HMM have been variously applied, for example in speech or hand- 
writing recognition, in robotic, etc. However there are still few applications 
of hierarchical HMMj^jlE], although it is probable that HHMM should allow 
a more abstract representation of the processes. 

In [H], Fine, Singer and Tishby applied HHMM to identify combinations 
of letters in handwriting. The main difficulty was to derive appropriate 
Viterbi and Baum- Welch algorithms. The Complexity of Fine and al meth- 
ods was about T^Q^, where T is the length of the samples, A is the number 
of levels of the HHMM tree and Q is the number of states per levels of the 
HHMM. Such required computation times are obviously difficult to apply; 
for most samples T is too big. 

In their work [7j, Murphy and Paskin have shown how HHMM could be 
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Figure 7: Model of a Hierarchical HMM 

• S- O S- • 5- O S- » — — — — 

• O 9- • S- O 9- « _ _ _ _ 

• 9- O 9- • 9- O 9- • — — — — 

• 9- O 9- • 9- O 9- » — — — — 

y y 

V V V 

• / o / y = information / information+boolean / ouput 

interpreted as a particular 2— dimension dynamic Bayesian Network sized 
A X T, and derived algorithms of complexity TQ^^. This is much better, 
but remains exponential with A. The number of levels is thus strongly lim- 
ited. Although Murphy and Paskin did not express this property explicitely, 
it is easy to show that their Bayesian Network has the Markov property both 
in time and in level. In fact, a hierarchical HMM could be interpreted by 
a Bayesian Network as described in figure [7| (refer to appendix^ . A hier- 
archical HMM is thus clearly a HMM with vectorial states. However, these 
states should contains some boolean components which are used to define 
the sub-process call and return[3 (also refer to appendix^. Moreover, 
there is a downward and upward exchange of information between the lev- 
els, in a hierarchical HMM. 

The next section introduces a model of controlled HHMM , which is (freely) 
inspired from the BN of figure [7|. 

3.2 Definition of a controlled model 

A natural extension of the BN of figure [7| is obtained by adding an input 
door to each upward column of the BN (then completing it), as described in 
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Figure 8: Model of a controlled Hierarchical HMM 
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figure IHl moreover, boolean components are then needed in all hidden cells, 
so that the whole memory is required to be discrete or semi-continuous. 
Then, an alternance of output and input is obtained, which exactly matches 
our problem. This is probably a very good model and future works should 
investigate this solution. But for historical reasons in our work, another 
model, similar, has been chosen for now. . . 

In the HHMM family which has been implemented, each memory state re- 
ceives an information from the current upper-level state and the previous 
lower-level state. This model guarantees a propagation of the information 
between the several levels of the hierarchy, but this propagation is slower 
than in the case of figure |H1 More precisely, this HHMM family being de- 
noted Ti , any hhmm /i € H takes the form: 

T A 
Kx\y) = \{h\xt\m])h\m]\yt^^,ml) J] h\m^\m'^zl M^^) , 

mljAGMi^A t=l A=2 

where is the variable, or memory, for the hidden level A of the HHMM, 
is the set of possible states for the variable , and A is the number 
of levels of the HHMM. The law h is described graphically in figure IHl It 
is noteworthy that this model is equivalent to a simple HMM when A = 2 . 
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Figure 9: HHMM model for the planning 
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And when A = 1 , the law h just maps the immediate observation to action, 
without any memory of the past observations. 

For any h £ 7i, define P[h] the complete probabilistic law of the system 
Universe/Planner: 

T 

P[h]{x,y,Z,m) = P{yi:T, Zl:T\xi:T)'[\_^°i^t\'^t) 

A 



i=l 



:h\m]\yt^i,m^)Y[h^{r 



A = 2 



Our approximated planning problem reduces to find the near-optimal strat- 
egy ho € H such that: 

ho € argmaxV'y' V] V] P[h]{x,y, z,m)V{xi.T,yi:T, zi;t) ■ 



^l-.T Vl-.T Zl:T mVi^ 



A solution to this problem, by means of the cross-entropy method, is pro- 
posed in the next section. 



4 Cross-entropic optimization of h 

The reader interested in CE methods should refer to the tutorial on the CE 
method^. CE algorithms were first dedicated to estimating the probability 
of rare events. A slight change of the basic algorithm made it also good 
for optimization. In their new articlejS], Homem-de-Mello and Rubinstein 
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have given some results about the global convergence. In order to ensure 
such convergence, some refinements are introduced particularly about the 
selective rate. 

This presentation is restricted to the CE optimization method. The new 
improvements of the CE algorithm proposed in |H] have not been imple- 
mented, but the algorithm has been seen to work properly. For this reason, 
this paper does not deal with the choice of the selective rate. 

4.1 General CE algorithm for the optimization 

The Cross Entropy algorithm repeats until convergence the three successive 
phases: 

1. Generate samples of random data according to a parameterized ran- 
dom mechanism, 

2. Select the best samples according to an evaluation criterion, 

3. Update the parameters of the random mechanism, on the basis of the 
selected samples. 

In the particular case of CE, the update in phase 3 is obtained by minimizing 
the Kullback-Leibler distance, or cross entropy, between the updated random 
mechanism and the selected samples. The next paragraphs describe on 
a theoretical example how such method can be used in an optimization 
problem. 

Formalism. Let be given a function x i— > /(x); this function is easily 
computable. The value f{x) has to be maximized, by optimizing the choice 
of x G X. The function / will be the evaluation criterion. 

Now let be given a family of probabilistic laws, -Po-Io-gs , applying on the 
variable x. The family P is the parameterized random mechanism. The 
variable x is the random data. 

Let /9 G ]0, 1[ be a selective rate. The CE algorithm for (x, /, P) follows the 
synopsis : 

1 . Initialize o" G S , 
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2. Generate samples Xn according to Pa , 

3. Select the pN best samples according to the evaluation criterion / , 

4. Update <t as a minimizer of the cross-entropy with the selected samples: 



5. Repeat from step |21 until convergence. 
This algorithm requires f to be easily computable. 

Interpretation. The CE algorithm tightens the law around the max- 
imizer of /. Then, when the probabilistic family P is well suited to the 
maximization of / , it becomes equivalent to find a maximizer for / or to 
optimize the parameter a by means of the CE algorithm. The problem is 
to find a good family. . . Another issue is the criterion for deciding the con- 
vergence. Some answers are given in Now, it is outside the scope of 
this paper to investigate these questions precisely. Our criterion was to stop 
after a given threshold of successive unsuccessful tries and this very simple 
method have worked fine on our problem. 

4.2 Application 

Optimizing h ^ 7i means tuning the parameter h in order to tighten the 
probability P[h] around the optimal values for V . This is exactly solved by 
the Cross-Entropy optimization method. However, it is required that the 
evaluation function V is easily computable. Typically, the definition of V 
may be recursive, eg. : 

V{xi;T,yi:T,Zl:T) = {vt{xt, Vt, Zt,}^^^ Vi{xi, yi, Zi) . 

Let the selective rate p he a positive number such that p < 1 . The cross- 
entropy method for optimizing h follows the synopsis : 

1. Initialize h. For example a flat h, 

2. Make tossing 9"- = (x", y", z", m") according to the law P[h], 

3. Choose the pN best samples 9^ according to the sample evaluation 
V{x^.rp,y^.rp, z^.rp) . Deuotc S the set of the selected samples. 



a G argmax 




n selected 
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4. Update h as the minimizer of the cross-entropy with the selected sam- 
ples: 

h e arg max ^ In P [/i] (0" ) , (3) 



hen ^ 



5. Reiterate from step [S] until convergence. 
For our HHMM model, the maximization (j^J is solved by: 

card-^ n £ S ,t / A = and B = m]'^ \ 

h'{A\B) = ^ — L , 

card< n £ S ,t / B = m^' > 

card|?7- G S ,t / A = ml'"" , B = y'^_i and C = ?tIj'"| 
card|n G S ,t / B = y^_i and C = '"| 
and for 2 < A < A ,: 

card|n G S ,t / A = m^'^ , B = m^Zi'^ and C = m^"*"^'"! 



h'^{A\B,C) 



h''iA\B,C) 



card|n £ S ,t / B = m^_^'"' and C = 



4.3 Bypassing the prior modelling 

POMDPs have a main drawback, they require the priors about the world 
and the mission evaluation to be defined. Defining the mission evaluation 
is not a difficult task in general. But it is quite uneasy to define the HMM 
which is modelling the world. For example, what is the structure (memory 
requirement, time/space hierarchies) of this HMM? What are the transition 
laws? In many works, the structure is defined by the designer of the sys- 
tem. But generally, the transition parameters are specified by learning from 
the actual world {eg. Baum- Welch algorithm). At last, the two step are 
implemented: 

• Learn priorly the parameters of the HMM associated to the actual 
world (Baum- Welch), 

• Solve the POMDP for the instant play. 

This is what is generally done when using the dynamic programming view- 
point. But there is something nice with the cross-entropic methods. Since 
the algorithm just needs samples, the HMM is not required. More precisely. 
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since P[h] = P{yi:T, zi:T\xi:T)h{xi:T,mi-T\yi:T) , the update property Q 
may be factorized and the following simple result is derived: 



What is needed is a stochatic process, which, in association with the policy 
h , will build the samples. This process may just be the actual world itself! 
Then, optimizing h only requires time for experimental implementations : 

1 . Initialize h , 

2. N samples 9"" = (x'^,y",m") and their respective evaluations are 
obtained by repeating times the following procedure: 

(a) Initialize t = 1 , 

(b) Make a toss of x" and m" according to the law h[xt,mt\mll_^^y^_^) , 

(c) Execute the action in the actual world, 

(d) Make the actual measurement y", 

(e) Set t = t + 1, 

(f) Repeat from step I2bl until t > T, 

(g) Make an actual evaluation, denoted Vn, of the whole play (this 
evaluation depends on the actions, observations, and of the actual 
hidden states), 

3. Choose the pN best samples 6^ according to the sample evaluation 
y". Denote S the set of the selected samples, 

4. Update h as the minimizer of the cross-entropy with the selected sam- 



5. Reiterate from step |21 until convergence. 

This method avoids any construction of the world prior P{y,z\x), but it 
requires many experimentations on the actual world. This is sometimes 
unworkable. 

The next section presents an example of implementation of the algorithm 
described in section 14.21 




neS 



pies: 




nG5 
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5 Implementation 



The algorithm has been apphed to a simphfied target detection problem. At 
this time, the open-loop implementation has not been investigated yet. 

5.1 Problem setting 

A target R is moving in a lattice of 20 x 20 cells, ie. [0, 19 p. R is tracked 
by two mobiles, B and C, controlled by the planner. The coordinate of R, 
B and C at time t are respectively denoted ijj), (^B'J'b) ^'^'^ (^C'-^c)- 
B and C have a very limited information about the target position, and are 
maneuvering much slower: 

• A move for B (respectively C) is either: turn left, turn right, go forward, 
no move. Consequently, there are 4x4 = 16 possible actions for the planner. 
These moves cannot be combined in a single turn. No diagonal forward: a 
mobile is either directed up, right, down or left, 

• The mobiles are initially positioned in the down corners, ie. i^ = 0, 
jj^ = 19 and = 19, j'^ = 19. The mobile are initially directed downward, 

• B (respectively C) observes whether the target relative position is forward 
or not. More precisely: 

o when B is directed upward, the mobile knows whether jji < js or not, 
o when B is directed right, the mobile knows whether i/j > is or not, 
o when B is directed downward, the mobile knows whether jjj > js or 
not, 

o when B is directed left, the mobile knows whether ipi < is or not, 

• B (respectively C) knows whether its distance with the target is less than 
3, ie. doo{B,R) < 3, or not. The distance doo is defined by: 

doo{B, R) = max{\iB - inl , {Jb - JrI} ■ 

At last, there are 2^ = 16 possible observations for the planner. 

Several test cases have been considered. In case 1, the target R does not 
move. In any other case, the target R chooses stochastically its next position 
in its neighborhood. Any move is possible (up/down, left/right, diagonals, 
no move). The probability to choose a new position is proportional to the 
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sum of the squared distance from the mobiles: 
r P{R'+'\R') = if 14+1 - 41 > 1 or 14+1 - 41 > 1 , 

P{Rt+^\Rt) oc (4+1 - 4)' + U'lt'-jB? 
[ Hi'^'-^hf + Ur-jhf else. 

This definition was intended to favorize escape moves: more great is a dis- 
tance, more probable is the move. But in such summation, a short distance 
will be neglected compared to a long distance (this was initially a mistake 
in the modelling!). It is implied that a distant mobile will hide a nearby 
mobile. This "deluding" property is interesting and will induce actually two 
different kinds of strategy, whithin the learned machines. 

The objective of the planner is to maintain the target sufficiently closed to 
at least one mobile (in this example, the distance between the target and a 
mobile is required to be not more than 3). More precisely, the evaluation 
function, V, is just counting the number of such "encounter": 

V^o = ; Vt = if d{B\ i?*) < 3 or d{C\ i?*) < 3 ; Vt = Vt-\ else. 

The total number of turns is T = 100. 

5.2 Results 

Generality. Like many stochastic algorithms, this algorithm needs some 
time for convergence. For the considered example, about two hours were 
needed for convergence (on a 2GHz PC); the selective rate was p = 0.5. 
This speed depends on the size of the HHMM model and on the conver- 
gence criterion. A weak and a strong criterion are used for deciding the 
convergence. Within the weak criterion, the algorithm is halted after 100 
successive unsuccessful tries. Within the strong criterion, the algorithm is 
halted after 500 successive unsuccessful tries. Of course, the strong criterion 
computes a (slightly) better optimum than the weak criterion, but it needs 
time. Because of the many tested examples, the weak criterion has been the 
most used in particular for the big models. For the same HHMM model, the 
computed optimal values do not depend on the algorithmic instance (small 
variations result however from the stochastic nature of the algorithm): the 
convergence seems to be global. 



21 



Figure 10: Near-optimal control sequence 




X = target • = observer 1 o = observer 2 
Relative times are put in supscript 

In the sequel, mean evaluations are rounded to the nearest integer. The 
presentation is made clearer. And owing to the small variations of this 
stochastic algorithm, more precision turns out to be irrelevant. 

Case 1: R does not move. This example has been considered in order 
to test the algorithm. The position of the target is fixed in the center of the 
square space, ie. i]^ = jj^ = 10. It is recalled that the mobiles are initially 
directed downward. Then, the optimal strategy is known and its value is 85 : 
the time needed to reach the target is 15 , and no further move is needed. 
The learned ho approximates the evaluation 84 . The convergence is good. 

Case 2: R is moving but the observation y is hidden. Initially, R 
is located within the 20 x 10 upper cells of the lattice {ie. [0, 19] x [0,9]), 
accordingly to a uniform probabilistic law. The computed optimal means 
evaluation is about 32. In this case, the mobiles tend to move towards the 
upper corners. 
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Case 3: R is moving and y is observed. Again, R is located uniformly 
within the 20 x 10 upper cells of the lattice. The computed optimal means 
evaluation is about 69. This evaluation has been obtained from a large 
HHMM model (A = 2 with 256 states per level, ie. card(M^) = 256) and 
with the strong criterion. However, somewhat smaller models should work 
as well. 

Specific computations are now presented, depending on the number of levels 
A and the number of states per levels. For each case, the weak criterion has 
been used. 

Subcase A = 1 . For such model, the action xt is constructed only from the 
immediate last observation yt-i- The model does not keep any memory of 
the past observations. Then, only 16 states are sufficient to describe the 
hidden variable m\ , ie. card(M^) = 16. The resulting optimum is 54. 

Subcase A = 2 . This model is equivalent to a HMM and it is assumed that 
card(M^) = card(M^). The following table gives the computed optimum 
for several choices of the set of states: 



card(A/^) 


16 


32 


64 


256 


Evaluation 


65 


66 


67 


67 



It is noteworthy that the memory of the past observations allows better 
strategies than just immediate observations (case A = 1) . Indeed, the eval- 
uation jumps from 54 up to 67. 

Subcases A > 2. A comparison of graduated hierarchic models, 1 < A < 4, 
has been made. The first level contained 16 possible states, and the higger 
levels were restricted to 2 states: 



hierarchic grade 


A= 1 


A = 2 


A = 3 


A = 4 


card(Af^)|^^^ 


16 


16,2 


16,2,2 


16,2,2,2 



The test has been accomplished according to the weak criterion: 



hierarchic grade 


A = 1 


A = 2 


A = 3 


A = 4 


Evaluation (weak) 


54 


59 


56 


65 



and the strong criterion: 



hierarchic grade 


A = 1 


A = 2 


A = 3 


A = 4 


Evaluation (strong) 


55 


61 


64 


66 
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It seems that a high hierarchic grade makes the convergence more difficult. 
This is particularly the case here for the grade A = 3 , which failed under 
the weak criterion at only 56. However, the algorithm still works well by 
improving the convergence criterion. 

It is interesting to make a comparison with the subcase A = 2 where 
card(M^) = card(Af^) = 16. Under the weak criterion, the result for this 
HHMM was 65 as for the grade A = 4. However, the dimension of the law 
is quite different for the two models: 

• 15 X 16 + 15 X 16 X 16 + 15 X 16 = 4320 for the 2-level HHMM, 

• 15 X 16 + 15 X 16 X 2 + 1 X 16 X 2 + 1 X 2 X 2 + 1 X 2 = 758 for the 
4-level HHMM. 

This dimension is a rough characterization of the complexity of the model. It 
seems clear on these examples that the highly hierarchized models are more 
efficient than the weakly hierarchized models. And the problem considered 
here is quite simple. On complex problems, hierarchical models may be 
pre-eminent. 

Global behavior. 

The algorithm. The convergence speed is low at the beginning. After this 
initial stage, it improves greatly until it reaches a new "waiting" stage. This 
alternation of low speed and great speed stages have been noticed several 
times until an acceptable convergence. Nevertheless, the speed is globally 
decreasing with the time. 

The near optimal policy. It is now discussed about the behaviour of the best 
found policy. This HHMM has reach the mean evaluation 69. The mobiles 
strategy results in a tracking of the target. The figure fTHl illustrates a short 
sequence of escape/tracking of the target. It has been noticed two quite 
distinct behaviours, among the many runs of the policy: 

• The two mobiles may both cooperate on tracking the target, 

• When the target is near the border, one mobile may stay along the 
opposite border while the other mobile may perform the tracking. This 
strategy seems strange at first sight. But it is recalled that the moving 
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rule of the target tends to neglect a nearby mobile compared to a 
distant mobile. In this strategy, the first mobile is just annihilating 
the ability of the target to escape from the tracking of the second 
mobile. 

6 Conclusion 

In this paper, we proposed a general method for approximating the optimal 
planning in a partially observable control problem. Hierarchical HMM fam- 
ilies have been used for approximating the optimal decision tree, and the 
approximation has been optimized by means of the Cross-Entropy method. 
Moreover, hierarchical HMM has been characterized as a multilevel HMM 
with alternately up-and-down (Markovian) diffusions of the information be- 
tween the levels. The algorithm has been applied on a simplier model based 
on a weaker diffusion of the information. 

At this time, the method has been applied to a strictly discrete-state prob- 
lem and has been seen to work properly. The convergence seems global: 
the many runs of the algorithm have reach the approximately same optimal 
values. The optimal HHMMs are able to "track" the target. An interesting 
point is that these HHMMs have discovered two quite different global strate- 
gies and are able to choose between them: make the mobiles both cooperate 
on tracking or require one mobile for deluding the target. 

The results are promising, but the example test is still simple. The obser- 
vation and action spaces are limited to a few number of states. And what 
happens if the hidden space becomes much more intricated? There are sev- 
eral possible answers to such difficulties: 

First, it has been shown that it is not necessary to define the priors of the 
problem (defining priors often implies many approximations): the CE algo- 
rithm is able to learn the policy directly from the actual world. Moreover, 
the cross-entropic principle could be applied for optimizing continuous laws. 
It is thus certainly possible to consider mixed continuous/discrete HHMM, 
which are more realistic for a planning policy. At last, many refinements are 
foreseeable about the structure of the HHMMs. More precisely, hierarchic 
models for observation, action and memory (hierarchical HMMs, but also 
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hierarchical Markov Random Fields) should be improved in order to locally 
factorize intricated problems. This research is just preliminary and future 
works should investigate these questions. 
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A Murphy and Paskin Bayesian Network 

Definition. A HHMM is a HMM which output is either an observation 
(when reaching the output level) or a HHMM. A formal definition is given 
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here, for peoples acquainted with such HMM formahsm. But it is not ex- 
plained in detail. 

Formalism. A HHMM with A levels is characterized by an observa- 
tion set Qi and by A — 1 quadruplet {Qd, Ed, Pd,'^d)2<d<A- For each level 
d S [2, A] , the quadruplet {Qd, E^, p^, vr^) is related to a HMM for the level 
d, thus verifying: 

• Qd IS a, set of states for the level d, 

• Ed CI Qd is a set of ending states (may be empty for the root A). 
When such state is obtained, the HMM stops and jumps back to the 
HMM of higher level; for the root level, the HHMM just ends, 

• Pd{qd-i,t,qd,t) = P{qd-i,t\qd,t) describes the probability for the HMM 
at level d to produce (at time t) the output qd~i,t £ Qd-i when the 
inner state is qd^t G Qd- When d = 2, the output is just an observation 
produced by the HHMM. When d > 2, the HMM of level d initiates 
the children HMM of level d — 1 with the starting state qd-i,t , and 
waits until the children HMM reachs an ending state, 

• '^diqd,t,qd,t+i) = P{qd,t+i\qd,t) describes the probability for the HMM 
at level d to transit from a state qd^t G Qd \ Ed at time t to the state 
qd,t+i £ Qd at time t + 1. This transit is runned after the production 
stage. 

In this definition, '7rA(0, (?a,i) = P{qA,i) initializes the HHMM at time 1. 

Related Bayesian Network. Murphin and Paskin[7j proposed an alter- 
nate definition of the HHMM by means of Bayesian Network. This BN is in 
fact a vectorized HMM, which has the Markov property both in time and in 
hierarchical level. This BN is relying on 2 type of cells: state cells (symbols 
• , V) and boolean cells (symbols Q,©,^). More precisely: 

• means an inner state cell, 

V means an output state cell, 

means a unspecified boolean cell, 

means a boolean cell specified FALSE, 

means a boolean cell specified TRUE. 
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Figure 11: A Hierarchical HMM (4 levels) 
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The structure of the Bayesian Network is given by figure [TTl The boolean 
cells are dedicated to the control of the state transition. A 0/FALSE cell in- 
dicates that the bottom level is running. A 0/TRUE cell indicates that the 
bottom level is ended. Depending on the boolean configuration, it is chosen 
a HMM transit vr , a HMM production p or a wait (the states are leaved 
unchanged by the identity id). All the state transitions are summarized in 
tables n and EJ For example, when a children level is running, the father 
level is waiting and the father row is between two cells: the transition is 
id and the father state is leaved unchanged 

On the other hand, the table El summarizes the boolean transitions. A 
boolean cell may just be a copy of the bottom boolean cell, when this cell 
is False. Indeed, that means that the children level is running, and conse- 
quently the father level is running although in a waiting mode. Otherwise, 
the transition is controlled by the HMM transit vr . 



Bayesian Network equivalence. It is shown now that the BN of fig- 
ure ^2 is a particular case of the BN of figure El- The simple hierarchical 
model of figure [7| is thus sufficient to describes a general HHMM. 
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Table 1: Transition rule: for a state cell 




Table 2: Transition rule: for a state cell (border cases) 



Config. 




\ 




• • 




\ 


> 


1 



\ 




/ 











Result. 


id 

• • 


TV 

• 9- • 


• 4 


> 


• 

P 

• 
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Table 3: Transition rule: for a boolean cell 



Config. 




• 






• 






/ 

• 


Result. 


:= 




• 




• 



The BN of Murphy is really not far from figure [7| It is just needed to dis- 
tort the arrows of figure ITTl to add intermediate state cells, denoted and 
intermediate boolean cells, denoted B. The intermediate Bayesian Network 
of figure El is obtained then, and is clearly equivalent to the BN of Murphy. 
By fusing the neighbooring cells 0,*, B into the cells o and redefining the 
transition rules properly, the BN of figure El is resulting, which is of the 
same kind as in figure [3 Notice that the cells o are possibly containing two 
booleans and an information. 
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Figure 12: Modified Hierarchical HMM 
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Figure 13: Simplified Model of a HHMM 
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